feat: add E2B cloud sandbox environment by JunYeopLee · Pull Request #792 · SWE-agent/mini-swe-agent

JunYeopLee · 2026-03-24T10:40:52Z

Summary

This PR adds E2BEnvironment, a new environment backend that executes commands inside E2B cloud sandboxes.

Key design decisions

Automatic template management
The first time a Docker image is used, E2BTemplateManager converts it into a persistent E2B template via Template.build(). Subsequent runs reuse the cached template, so the one-time build cost is paid only once per unique image.

Deterministic template naming
_image_to_template_name() maps Docker image names to stable, collision-resistant E2B template names using a sha256 8-character suffix. The result always stays within E2B's 63-character, alphanumeric-plus-hyphen limit.

Thread-safe build timeout
Template builds are wrapped in a ThreadPoolExecutor (rather than signal.alarm) so that the timeout works correctly when invoked from worker threads — e.g., during parallel SWE-bench evaluation runs.

SWE-bench integration
get_sb_environment() in swebench.py now injects the per-instance image for e2b the same way it already does for docker and swerex_modal.

Changes

File	Change
`src/minisweagent/environments/extra/e2b.py`	New environment class (`E2BEnvironmentConfig`, `E2BTemplateManager`, `E2BEnvironment`)
`src/minisweagent/environments/__init__.py`	Register `"e2b"` key in the environment mapping
`src/minisweagent/run/benchmarks/swebench.py`	Inject Docker image for `e2b` environment class
`pyproject.toml`	Add `e2b` optional dependency (`e2b>=1.0.0`)
`tests/environments/extra/test_e2b.py`	18 unit tests (all passing)
`docs/advanced/environments.md`	Add `e2b` entry to the environment list
`docs/reference/environments/e2b.md`	New reference page for `E2BEnvironment`
`README.md`	Mention E2B in the deployable environments list

Usage

Install the extra and set the API key:

pip install "mini-swe-agent[e2b]"
export E2B_API_KEY="your-e2b-api-key"

Run SWE-bench evaluation via E2B:

mini-extra swebench \
    --subset verified \
    --split test \
    --workers 50 \
    --environment-class e2b

Or in a YAML config:

environment:
  environment_class: e2b
  sandbox_timeout: 3600
  cpu_count: 2
  memory_mb: 2048

Test plan

18 unit tests covering E2BEnvironmentConfig, _image_to_template_name, execute() (dict/string action, non-zero exit, exception, Submitted detection), serialize() (structure, credential exclusion), stop() (normal, missing sandbox, exception-tolerant)
Full test suite (485 passed, 33 skipped) passes without regressions

Add `E2BEnvironment`, a new environment backend that runs commands inside [E2B](https://e2b.dev) cloud sandboxes. Unlike the Docker and Modal backends, it requires no local Docker daemon — the sandbox runs entirely in the cloud. Key design decisions: - **Automatic template management**: The first time a Docker image is used, `E2BTemplateManager` converts it into a persistent E2B template via `Template.build()`. Subsequent runs reuse the cached template, so the build cost is paid only once per unique image. - **Deterministic template naming**: `_image_to_template_name()` produces a stable, collision-resistant name (sha256 8-char suffix) that stays within E2B's 63-character, alphanumeric-plus-hyphen limit. - **Thread-safe build timeout**: Template builds run in a `ThreadPoolExecutor` (not `signal.alarm`) so that the timeout works correctly when called from worker threads (e.g., parallel SWE-bench runs). - **SWE-bench integration**: `get_sb_environment()` in `swebench.py` now injects the instance image for `e2b` the same way it does for `docker` and `swerex_modal`. Changes: - `src/minisweagent/environments/extra/e2b.py` — new environment class - `src/minisweagent/environments/__init__.py` — register `"e2b"` key - `src/minisweagent/run/benchmarks/swebench.py` — inject image for e2b - `pyproject.toml` — add `e2b` optional dependency (`e2b>=1.0.0`) - `tests/environments/extra/test_e2b.py` — 18 unit tests (all passing) - `docs/` — update environments reference and README

for more information, see https://pre-commit.ci

Add a module-level _active_sandboxes set and an atexit handler (_cleanup_all_sandboxes) that kills all live sandboxes when the interpreter exits. This ensures sandboxes are cleaned up on Ctrl+C or unhandled exceptions where __del__ may not be reliably called. - __init__ adds self to _active_sandboxes after sandbox creation - stop() removes self from _active_sandboxes before calling sandbox.kill() - atexit handler iterates over a snapshot of the set to avoid mutation issues Two additional tests cover the registry and cleanup behaviour.

E2B_ACCESS_TOKEN / access_token is not recognised by the E2B SDK. Remove the config field, all call-site usages (Template.build, Sandbox.create), the serialize exclusion, and the corresponding tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

get_or_build() was short-circuiting to the else branch without consulting skip_cache, making force-rebuild impossible despite the field documenting "force-rebuild even if it already exists". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

klieret · 2026-03-24T17:04:54Z

Great! Happy to include that. I'm gonna look through things in detail soon.
Have you done some real test runs to confirm everything works (on a few SWE-bench instances with swebench.py for example)?

JunYeopLee · 2026-03-24T18:41:49Z

Hello @klieret, nice to meet you.

I’ve run some test executions using the setup below:

source .env
# OPENAI_API_KEY=
# E2B_API_KEY=

uv run mini-extra swebench \
    --model openai/gpt-5-nano \
    --split test \
    --workers 4 --environment-class e2b --output ./results

As shown in the screenshot, the execution is working as expected. I’ve also attached the result file for reference.

Please let me know if there are any additional checks or scenarios you’d like me to validate.

scikit-learn__scikit-learn-12471.traj.json

SWE-bench Docker images have /testbed owned by root, but E2B sandboxes run commands as user (UID 1000) by default, causing permission denied. Add user="root" to commands.run() to match Docker behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

klieret · 2026-03-26T16:04:51Z

Awesome, let me review in detail today!

klieret · 2026-03-26T16:06:55Z

(Also I sometimes announce new features/tag new contributors on twitter/linkedin, do you have an account that I should mention?)

JunYeopLee · 2026-03-26T21:43:51Z

@klieret Sounds great!

Here are my linkedin/X profile.

https://www.linkedin.com/in/leejunyeop/
https://x.com/junyeoplee2

Thanks :)

When a cached template has been deleted on E2B's servers, Template.exists() still returns True but Sandbox.create() fails with a 404. This catches the error and triggers a rebuild automatically instead of requiring manual skip_cache=True.

Previously, sandbox resources were only cleaned up via atexit handler, which would not run if the process was forcefully terminated (e.g. double Ctrl+C). Now env.stop() is called in the finally block.

for more information, see https://pre-commit.ci

JunYeopLee · 2026-04-07T13:23:09Z

Hi @klieret, hope you doing well.

I ran a full reproduction of the E2B environment on SWE-bench Verified (test split, 500 instances) using openai/gpt-5-mini.

Result: 56.40% (282 / 500) - consistent with the reference run on the dashboard.

Full evaluation report and predictions : https://gist.github.com/JunYeopLee/951e669863eb19be9a03b04b2b386fc6

Command

uv run mini-extra swebench \
      -m openai/gpt-5-mini \
      -c swebench.yaml \
      -c model.model_class=minisweagent.models.litellm_response_model.LitellmResponseModel \
      -c model.model_kwargs.temperature=null \
      -c model.model_kwargs.text.verbosity=medium \
      -c model.model_kwargs.reasoning.effort=medium \
      --environment-class e2b \
      -c environment.sandbox_timeout=7200 \
      --subset verified \
      --split test \
      --workers 40 \
      -o results_gpt_5_mini/

Please check :D

JunYeopLee · 2026-05-12T19:51:46Z

Hi @klieret , just wanted to kindly follow up on this PR.

I was wondering if there are any updates, or if there is anything else I can help with to move this forward.

Happy to make any changes or run additional checks if needed.

klieret · 2026-06-10T21:52:34Z

Hi @JunYeopLee so sorry for the late response, there was just so much going on. Let me look into things right now

Copilot

Pull request overview

Adds a new E2BEnvironment backend to run commands inside E2B cloud sandboxes (including template caching/building), wires it into SWE-bench runs, and documents/tests the new environment.

Changes:

Introduce E2BEnvironmentConfig, E2BTemplateManager, and E2BEnvironment for E2B-based execution (template build + cache).
Register the new "e2b" environment key and inject per-instance SWE-bench images when using e2b.
Add unit tests and documentation for configuring/using the E2B environment; add optional e2b dependency extra.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/minisweagent/environments/extra/e2b.py`	New E2B sandbox environment + template management and cleanup.
`src/minisweagent/environments/__init__.py`	Registers `e2b` in the environment mapping.
`src/minisweagent/run/benchmarks/swebench.py`	Injects SWE-bench image for `e2b`; stops environments after each instance.
`pyproject.toml`	Adds `e2b` optional dependency extra.
`tests/environments/extra/test_e2b.py`	Adds unit tests for naming, execution behavior, serialization, and cleanup.
`docs/advanced/environments.md`	Adds `e2b` to the environment list.
`docs/reference/environments/e2b.md`	Adds a reference page for the E2B environment.
`README.md`	Mentions E2B as a deployable environment option.

 def get_sb_environment(config: dict, instance: dict) -> Environment:
    env_config = config.setdefault("environment", {})
    env_config["environment_class"] = env_config.get("environment_class", "docker")
    image_name = get_swebench_docker_image_name(instance)
-    if env_config["environment_class"] in ["docker", "swerex_modal"]:
+    if env_config["environment_class"] in ["docker", "swerex_modal", "e2b"]:


+from types import ModuleType
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from minisweagent.environments.extra.e2b import (
+    E2BEnvironment,
+    E2BEnvironmentConfig,
+    E2BTemplateManager,
+)
+from minisweagent.exceptions import Submitted
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_mock_e2b() -> ModuleType:
+    """Return a minimal mock of the `e2b` module."""
+    mock_e2b = MagicMock()
+    mock_e2b.Template = MagicMock()
+    mock_e2b.Sandbox = MagicMock()
+    return mock_e2b


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

for more information, see https://pre-commit.ci

klieret · 2026-06-10T22:04:59Z

+                template=template_name,
+                timeout=self.config.sandbox_timeout,
+                api_key=self.config.api_key,
+                metadata={"user": "junyeoplee2"},  # TEMP. DO NOT MERGE


How should this be adapted?

klieret · 2026-06-10T22:05:08Z

+                template=template_name,
+                timeout=self.config.sandbox_timeout,
+                api_key=self.config.api_key,
+                metadata={"user": "junyeoplee2"},  # TEMP. DO NOT MERGE


probably will also need to adapt this

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JunYeopLee · 2026-06-16T12:11:15Z

Hi @klieret Thanks for the review. I just fixed it.

e2b's commands.run() raises CommandExitException (carrying stdout/stderr/ exit_code) on any non-zero exit. The generic except branch masked every failing command as an infrastructure error with empty output and returncode -1, hiding real command output from the agent. Detect the exit_code-carrying exception and surface the real result instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Default/mini configs render {{system}}/{{machine}}/... under Jinja StrictUndefined. E2BEnvironment.get_template_vars omitted these keys (unlike docker/local), crashing those configs at agent startup. Merge platform.uname() like the other environments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

get_template_vars dumped the full config (api_key, registry credentials) into the Jinja prompt context. Exclude secrets there, and centralize the secret-field set so serialize() also drops registry_username (previously only api_key and registry_password were excluded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rename E2BEnvironment.stop() to cleanup() to match the docker/singularity/ bubblewrap convention. The swebench finally block previously called env.stop() guarded by hasattr, which was a silent no-op for the default docker backend (it exposes cleanup()). Call whichever teardown method exists (cleanup or stop) so every backend's per-instance resource is released. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The stale-cache rebuild path tested 'if "404" not in str(e)', which could match an incidental '404' inside a sandbox id or path (triggering an expensive needless rebuild) or miss differently-worded errors. e2b formats API errors as '{status_code}: {message}', so match the leading 404 status code via a small testable helper instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

env teardown was the last statement of the finally block, so an exception in agent.save() or update_preds_file() would skip it and leak the cloud sandbox/container until its timeout. Move teardown into its own nested finally and extract a _teardown_environment helper so cleanup always runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

codecov · 2026-06-17T02:22:08Z

Codecov Report

❌ Patch coverage is 70.90909% with 48 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/minisweagent/environments/extra/e2b.py	66.66%	48 Missing ⚠️

Files with missing lines	Coverage Δ
src/minisweagent/environments/__init__.py	`100.00% <ø> (ø)`
src/minisweagent/run/benchmarks/swebench.py	`85.62% <100.00%> (+0.18%)`	⬆️
src/minisweagent/environments/extra/e2b.py	`66.66% <66.66%> (ø)`

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

klieret · 2026-06-18T22:07:51Z

I just saw that there's some linting issues because of not being able to import e2b?

24s
Run # Run pylint with errors-only flag to only fail on real errors (E and F level)
************* Module src.minisweagent.environments.extra.e2b
src/minisweagent/environments/extra/e2b.py:102:8: E0401: Unable to import 'e2b' (import-error)
src/minisweagent/environments/extra/e2b.py:132:8: E0401: Unable to import 'e2b' (import-error)
src/minisweagent/environments/extra/e2b.py:196:8: E0401: Unable to import 'e2b' (import-error)
src/minisweagent/environments/extra/e2b.py:197:8: E0401: Unable to import 'e2b.exceptions' (import-error)

dannyward630 · 2026-06-19T04:45:43Z

Opened a small follow-up against the E2B branch for the pylint import failure: JunYeopLee#1. It adds the new E2B extra to the existing full extra, matching the upstream pylint workflow's .[full] install before linting the E2B module.

JunYeopLee · 2026-06-23T14:25:56Z

Hi @klieret,

Pushed 3 commits addressing the remaining review points

Pylint import-error (E0401): added the e2b extra to the full extra in pyproject.toml, so the pylint workflow (pip install .[full]) can now import e2b. Thanks @dannyward630 for the follow-up PR! I applied the same fix directly here.
Dead code: removed the unused _make_mock_e2b test helper.
Resource leak: get_sb_environment() now tears down the environment if env_startup_command fails before the caller gets a reference, so the sandbox doesn't leak on startup failure.

Thanks!

JunYeopLee and others added 5 commits March 24, 2026 10:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

db34ac7

for more information, see https://pre-commit.ci

JunYeopLee force-pushed the feat/e2b-environment branch from 9159fb4 to 32d1f91 Compare March 24, 2026 19:55

JunYeopLee and others added 3 commits April 7, 2026 12:29

fix: ensure sandbox cleanup in process_instance finally block

f8b9a05

Previously, sandbox resources were only cleaned up via atexit handler, which would not run if the process was forcefully terminated (e.g. double Ctrl+C). Now env.stop() is called in the finally block.

[pre-commit.ci] auto fixes from pre-commit.com hooks

bec508e

for more information, see https://pre-commit.ci

klieret requested a review from Copilot June 10, 2026 21:53

Copilot started reviewing on behalf of klieret June 10, 2026 21:53 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

klieret and others added 2 commits June 10, 2026 15:00

Move imports top

a07b79e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6718c1c

for more information, see https://pre-commit.ci

klieret reviewed Jun 10, 2026

View reviewed changes

fix: remove temporary E2B sandbox metadata override

efc3935

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JunYeopLee and others added 3 commits June 16, 2026 14:34

JunYeopLee and others added 4 commits June 16, 2026 14:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

88385c8

for more information, see https://pre-commit.ci

dannyward630 mentioned this pull request Jun 19, 2026

fix(deps): include e2b in full extra JunYeopLee/mini-swe-agent#1

Open

JunYeopLee added 3 commits June 23, 2026 15:23

fix: add e2b to full extra so pylint can import it

21d7e63

test: remove unused _make_mock_e2b helper

7befef8

fix: tear down environment when startup command fails

0eb096f

Uh oh!

Conversation

JunYeopLee commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design decisions

Changes

Usage

Test plan

Uh oh!

klieret commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JunYeopLee commented Mar 24, 2026

Uh oh!

klieret commented Mar 26, 2026

Uh oh!

klieret commented Mar 26, 2026

Uh oh!

JunYeopLee commented Mar 26, 2026

Uh oh!

JunYeopLee commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JunYeopLee commented May 12, 2026

Uh oh!

klieret commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

klieret Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

klieret Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

JunYeopLee commented Jun 16, 2026

Uh oh!

codecov Bot commented Jun 17, 2026

Codecov Report

Uh oh!

klieret commented Jun 18, 2026

Uh oh!

dannyward630 commented Jun 19, 2026

Uh oh!

JunYeopLee commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JunYeopLee commented Mar 24, 2026 •

edited

Loading

klieret commented Mar 24, 2026 •

edited

Loading

JunYeopLee commented Apr 7, 2026 •

edited

Loading